BUG: Timedelta not formatted correctly in to_json #28595

cbertinato · 2019-09-24T12:12:30Z

closes Timedelta in to_json object array and ISO dates not handled properly #28256
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

WillAyd

Thanks for the PR

WillAyd · 2019-09-24T15:53:02Z

pandas/io/json/_json.py

+
+        if is_timedelta64_dtype(obj.dtype) and self.date_format == "iso":
+            obj = obj.copy()
+            self.obj = obj.apply(lambda x: x.isoformat())


Hmm not sure about this approach - this doesn't work if the timedelta is contained within an object array right?

You're right, it won't. But the modifications in objToJSON.c should take care of that case.

Which reminds me, there should probably also be a test for the astype(object) case. I will add.

Do we need to do this at the Python level is ultimately the question though? This seems like it negates the need to make the change to the extension module since these delta objects would subsequently be strings when iso format is requested and not actually delta objects.

FWIW this is going to be rather non-performant, especially for large DataFrames

Well, the python-level change doesn't entirely negate the need for the change in the extension module.

If one does df.astype(object).to_json(date_format="iso"), then we do not execute this code in _json.py and instead goes to the modifications made in objToJSON.c.

So in the case where we pass an object array into the extension module, we can take care of the serialization in C. In the case where we attempt to serialize the DataFrame directly, timedeltas appear differently. Namely, they come into the C code as NumPy timedeltas for which there is no method to serialize as ISO. We could perhaps do a conversion to a Pandas timedelta in the C code and then use the isoformat() method inside the extension module.

I went with this python-level change based on earlier discussions we had about this issue. In particular: "This would mean dispatching to vectorized methods in Python where applicable (i.e. a DatetimeArray is converted to an array of ISO strings before getting to the serializer)..."

We do end up with a less performant result in the case that we directly serialize the DataFrame but may gain in simplicity. The fix for the object dtype case should still be good performance-wise.

That said, I would be amenable to attempting a fix contained entirely within the extension module.

Hmm OK. Yea this is a weird in between. Maybe OK for now knowing it fixes the bug and we can worry about performance in a follow up

WillAyd · 2019-09-24T15:53:51Z

pandas/tests/io/json/test_pandas.py

+    )
+    def test_series_timedelta_to_json(self, date_format, delta, formatted):
+        # GH28156: to_json not correctly formatting Timedelta
+        s = Series([pd.Timedelta(d) for d in delta])


Should be able to use pd.timedelta_range

WillAyd · 2019-09-24T15:54:11Z

pandas/tests/io/json/test_pandas.py

+    def test_series_timedelta_to_json(self, date_format, delta, formatted):
+        # GH28156: to_json not correctly formatting Timedelta
+        s = Series([pd.Timedelta(d) for d in delta])
+        s.index = ["one", "two"]


Is modifying the index critical for the test? If not best to exclude

I wouldn't say it's critical. I had initially thought to include it because in a previous iteration I had some code in there that was resetting the index and therefore causing a test like this to fail. But I can see that if the changes are sufficiently well-constrained to not mess with anything outside of what it should do, then changing the index in the test might not be critical.

WillAyd · 2019-09-24T15:56:00Z

pandas/tests/io/json/test_pandas.py

+        result = json.loads(s.to_json(date_format=date_format))
+        expected = {k: v for k, v in zip(s.index, formatted)}
+
+        assert result == expected


Hmm can you make the test compare against the actual expected output? Right now I'm not sure this actually tests the date format, i.e. it could still always write out in epoch and just infer that back on the way in and test would still pass

WillAyd

Could you add a benchmark that contains time deltas to track performance?

WillAyd · 2019-09-25T16:24:54Z

pandas/io/json/_json.py

+
+        if is_timedelta64_dtype(obj.dtype) and self.date_format == "iso":
+            obj = obj.copy()
+            self.obj = obj.apply(lambda x: x.isoformat())


Hmm OK. Yea this is a weird in between. Maybe OK for now knowing it fixes the bug and we can worry about performance in a follow up

WillAyd · 2019-09-25T16:25:11Z

pandas/io/json/_json.py

@@ -171,6 +171,34 @@ def _write(
 class SeriesWriter(Writer):
    _default_orient = "index"

+    def __init__(


Can you annotate these signatures?

pep8speaks · 2019-09-26T12:58:13Z

Hello @cbertinato! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-10-05 16:30:59 UTC

cbertinato · 2019-09-26T13:00:37Z

There was an existing DataFrame that mixed timedeltas, datetimes, and ints, so just to be safe I added another with only timedeltas. I also added a benchmark with date_format="iso". Just about all DataFrames containing timedeltas showed significant change:

       before           after         ratio
     [4fb853f6]       [aa9749a2]
     <28256-iso-date-json~2>       <28256-iso-date-json>
+        71.4±2ms       3.55±0.07s    49.76  io.json.ToJSON.time_to_json_iso('records', 'df_td')
+        68.8±1ms       3.40±0.06s    49.40  io.json.ToJSON.time_to_json_iso('values', 'df_td')
+        93.4±5ms       3.53±0.06s    37.81  io.json.ToJSON.time_to_json_iso('split', 'df_td')
+         179±7ms       3.79±0.03s    21.14  io.json.ToJSON.time_to_json_iso('split', 'df_td_int_ts')
+        183±20ms       3.50±0.05s    19.11  io.json.ToJSON.time_to_json_iso('values', 'df_td_int_ts')
+         184±9ms       3.43±0.08s    18.62  io.json.ToJSON.time_to_json_iso('records', 'df_td_int_ts')
+        363±10ms       3.70±0.05s    10.17  io.json.ToJSON.time_to_json_iso('columns', 'df_td')
+        375±10ms       3.70±0.05s     9.86  io.json.ToJSON.time_to_json_iso('index', 'df_td')
+        482±20ms       3.92±0.08s     8.13  io.json.ToJSON.time_to_json_iso('index', 'df_td_int_ts')
+         483±9ms       3.85±0.03s     7.96  io.json.ToJSON.time_to_json_iso('columns', 'df_td_int_ts')
+        116±10ms         482±30ms     4.16  io.json.ToJSON.time_to_json_iso('split', 'df_int_floats')
+        85.8±5ms         336±40ms     3.91  io.json.ToJSON.time_to_json_iso('values', 'df_int_floats')
+        120±10ms         455±30ms     3.80  io.json.ToJSON.time_to_json_iso('records', 'df_int_floats')
+        410±10ms         976±20ms     2.38  io.json.ToJSON.time_to_json_iso('index', 'df_int_floats')
+        420±10ms         982±20ms     2.34  io.json.ToJSON.time_to_json_iso('columns', 'df_int_floats')

Interestingly, there are some significant changes for int-float DataFrames as well.

cbertinato · 2019-10-05T16:37:04Z

I tried instead converting to pandas Timedelta in C which, performance-wise, is less bad than converting in Python, but still not that great:

       before           after         ratio
     [6396bc37]       [b97525a0]
     <master>         <28256-iso-date-json>
+        72.0±1ms       2.28±0.05s    31.61  io.json.ToJSON.time_to_json_iso('records', 'df_td')
+        68.3±4ms       2.14±0.01s    31.27  io.json.ToJSON.time_to_json_iso('values', 'df_td')
+        87.7±3ms       2.19±0.03s    25.01  io.json.ToJSON.time_to_json_iso('split', 'df_td')
+         148±2ms       2.30±0.01s    15.55  io.json.ToJSON.time_to_json_iso('values', 'df_td_int_ts')
+         174±7ms       2.37±0.04s    13.65  io.json.ToJSON.time_to_json_iso('split', 'df_td_int_ts')
+        179±10ms       2.40±0.01s    13.40  io.json.ToJSON.time_to_json_iso('records', 'df_td_int_ts')
+         351±9ms       2.42±0.04s     6.92  io.json.ToJSON.time_to_json_iso('columns', 'df_td')
+        356±10ms       2.45±0.01s     6.87  io.json.ToJSON.time_to_json_iso('index', 'df_td')
+         449±7ms       2.65±0.07s     5.89  io.json.ToJSON.time_to_json_iso('index', 'df_td_int_ts')
+         471±8ms       2.50±0.01s     5.30  io.json.ToJSON.time_to_json_iso('columns', 'df_td_int_ts')

I wonder whether we can ever do much better than this without a function that formats NumPy timedeltas directly.

jreback · 2019-11-13T17:49:00Z

@cbertinato what is status of this?

cbertinato · 2019-11-13T18:18:44Z

Unless I'm missing another way to solve this, it looks like we take a performance hit. So I suppose the question is whether this is tolerable for this fix, or whether we should trash this branch and go with a vectorized solution something like Will's work in #28863.

jreback · 2019-11-13T23:31:17Z

instead of this approach i would convert the timedeltas to longs (using vectorized methods on the series) before passing to the c based dumper

WillAyd · 2019-12-17T17:37:36Z

I think this should be easier to fix after #30283. I think will just need a new method for time delta to iso though

cbertinato · 2019-12-17T18:35:52Z

In C?

WillAyd · 2019-12-17T21:24:52Z

Right I think would be easier to do in C since the datetime code is already there

cbertinato · 2020-01-10T16:12:06Z

Sorry for the severe lag on this. Will pick it back up now that #30283 is in.

WillAyd · 2020-01-10T17:51:02Z

@cbertinato I think I'm close on a C impl actually. might be able to push up soon to compare

WillAyd · 2020-02-12T00:41:12Z

Thanks for the PR! I think will be superseded by #30903 though so closing for now

WillAyd added the IO JSON read_json, to_json, json_normalize label Sep 24, 2019

WillAyd requested changes Sep 24, 2019

View reviewed changes

cbertinato force-pushed the 28256-iso-date-json branch from 4cacdd1 to d7879af Compare September 24, 2019 19:30

dongho-jung mentioned this pull request Sep 25, 2019

BUG: value_counts can handle the case even with empty groups (#28479) #28608

Closed

5 tasks

WillAyd requested changes Sep 25, 2019

View reviewed changes

BUG: Timedelta not formatted correctly in to_json

72207d4

cbertinato force-pushed the 28256-iso-date-json branch from aa9749a to 72207d4 Compare September 26, 2019 13:34

WIP

928697e

WillAyd mentioned this pull request Oct 9, 2019

Vectorized ISO Format #28863

Closed

5 tasks

jbrockmendel mentioned this pull request Dec 23, 2019

Replaced void pointers with Types in JSON Datetime Conversions #30283

Merged

WillAyd mentioned this pull request Jan 10, 2020

Implement C Level Timedelta ISO Function; fix JSON usage #30903

Merged

5 tasks

WillAyd closed this Feb 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Timedelta not formatted correctly in to_json #28595

BUG: Timedelta not formatted correctly in to_json #28595

cbertinato commented Sep 24, 2019

WillAyd left a comment

WillAyd Sep 24, 2019

cbertinato Sep 24, 2019

cbertinato Sep 24, 2019

WillAyd Sep 24, 2019

cbertinato Sep 25, 2019

WillAyd Sep 25, 2019

WillAyd Sep 24, 2019

WillAyd Sep 24, 2019

cbertinato Sep 24, 2019

WillAyd Sep 24, 2019

cbertinato Sep 24, 2019

WillAyd left a comment

WillAyd Sep 25, 2019

WillAyd Sep 25, 2019

pep8speaks commented Sep 26, 2019 •

edited

Loading

cbertinato commented Sep 26, 2019 •

edited

Loading

cbertinato commented Oct 5, 2019

jreback commented Nov 13, 2019

cbertinato commented Nov 13, 2019

jreback commented Nov 13, 2019

WillAyd commented Dec 17, 2019

cbertinato commented Dec 17, 2019

WillAyd commented Dec 17, 2019

cbertinato commented Jan 10, 2020

WillAyd commented Jan 10, 2020

WillAyd commented Feb 12, 2020

BUG: Timedelta not formatted correctly in to_json #28595

BUG: Timedelta not formatted correctly in to_json #28595

Conversation

cbertinato commented Sep 24, 2019

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WillAyd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Sep 26, 2019 • edited Loading

Comment last updated at 2019-10-05 16:30:59 UTC

cbertinato commented Sep 26, 2019 • edited Loading

cbertinato commented Oct 5, 2019

jreback commented Nov 13, 2019

cbertinato commented Nov 13, 2019

jreback commented Nov 13, 2019

WillAyd commented Dec 17, 2019

cbertinato commented Dec 17, 2019

WillAyd commented Dec 17, 2019

cbertinato commented Jan 10, 2020

WillAyd commented Jan 10, 2020

WillAyd commented Feb 12, 2020

pep8speaks commented Sep 26, 2019 •

edited

Loading

cbertinato commented Sep 26, 2019 •

edited

Loading